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Introduction 

The Next Generation Sequencing (NGS) technology has 
revolutionized the genetics studies of RNA-Seq for tran- 
scriptome analysis and ChlP-Seq for DNA-proteins 
interaction [1]. ChlP-Seq has become the method of 
choice for genome-wide characterization of transcription 
factor binding, polymerase binding, and histone modifi- 
cations [2]. The identification of binding sites by tran- 
scription factors, polymerase, or histone modification 
marks plays a crucial role for identifying the regulatory 
elements that regulate the gene expression. Several peak 
calling tools have been developed for detecting the bind- 
ing sites in ChlP-Seq. These tools identify the binding 
sites using a common method of calculating the density 
of read counts called peaks. Peak calling tools output the 
list of peak sequences in various sizes and different for- 
mats. The actual binding sites are often short sequences 
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embedded in these peak sequences. The actual DNA re- 
gion, which interacts with a single transcription factor 
(TF), typically ranges from 8-10 to 16-20 bp [2]. In 
addition, the binding sites for TF in ChlP-Seq are usually 
located in close proximity of the summit points of the 
peaks [3]. Zambelli et al. describes the TFs bind to the 
DNA in a sequence specific way that they recognize se- 
quences that are similar but not identical and differ by 
only some nucleotides from one another [2]. Thus, iden- 
tifying the conserved motifs in these sequences reveals 
the same TF binding to them. Motif finding is one of the 
well-known studies in Bioinformatics. Many tools have 
been developed for motif finding. Recent motif finding 
development provides user-friendliness via Web inter- 
face. In this work, we surveyed nine motif finding Web 
tools that are capable for finding motifs in ChlP-Seq 
data. These tools are listed in Table 1. 

Review 

General approaches for motif finding 

Motifs are short sequences of a similar pattern found in 
sequences of DNA or protein. Consider t input nucleo- 
tide sequences of length n and an array 5 {si,S2,S3, ...,Si) 
of starting positions with each position comes from each 
sequence. An alignment matrix is a matrix of tx I, which 
contains t sequences of starting positions from each 
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Table 1 A summary of motif finding web tools 



Web Tool Pipeline Accept File Format Maximum File Size iVIaximum P-Value Motifs Size # of Motifs Ref. Database 

Sequence Length Option Option Option 



MEME 


No Fasta 


60000 characters 


< 1 000 bp 


No 


Yes 


Yes 


JASPAR, BLOCKS, UniProbe, 
database 


, user 


GLAM2 


No Fasta 


60000 characters 


= 1 0000 bp 


No 


No 


No 


JASPAR, UniProbe user database 


CisFinder 


No Fasta, plain text 
delimited 


Unspecified 


= 50 Mb 


FDR 
option 


No 


Yes 


JASPAR, CisView, 


user database 


W-ChlPMotifs 


Yes Fasta 


Unspecified 


Unspecified 


No 


No 


No 


JASPAR, TRANSFAC user database 


CompleteMOTIFs 


Yes Bed, fasta, gff = 500000 bp for MEME, Weeder, 

5000000 for ChlPMunk 


= Unspecified 


Yes 


Yes for 
MEME 


No 


JASPAR, TRANSFAC 




DREME 


No Fasta 


Unspecified 


Unspecified 


E-value 
option 


No 


No 


JASPAR, UniProbe , user database 


MEME-ChIP 


Yes Fasta 


Unlimited 


Unlimited 


E-value 
option 


Yes 


Yes 


JASPAR, UniProbe user database 


RSAT peak- 
motifs 


Yes Raw, multi, tab, fasta, 
wconsensus, IG 


Unlimited 


Unlimited 


No 


Yes 


Yes 


JASPAR, UniProbe, DMMPMM, 
RegulonDB, user database 




PScanChIP 


No Bed 


Unlimited 


100 -150 bp 


No 


No 


No 


JASPAR, TRANSFAC 






Web Tool 


Approach 


Ref. Database 
Option 


Ref. Genome 
Option 


Log in 
Required 


Email 
Required 


User 
Account 
Option 


Published 
Year 


Current 
Version 


Ref. # 


MEME 


Implemented Multiple EM 


No 


No 


No 


Yes 


No 


2006 


4.9.1 


[4] 


GLAM2 


Implemented novel Gapped Local Alignment 
of Motifs algorithm 


No 


No 


No 


Yes 


No 


2008 


4.9.1 


[5] 


CisFinder 


Implemented novel CisFinder algorithm 


Yes 


No 


Optional 


Optional 


Optional 


2009 


Unspecified 


[6] 


W-ChlPMotifs 


Used existing ChlPMotifs program and 
incorporated other existing tools: MEME, 
MaMF, and Weeder 


No 


Human and Mouse only 


No 


Yes 


No 


2009 


Unspecified 


[7] 


CompleteMOTIFs 


Integrated existing tools: MEME, Weeder, 
and ChlPMunk 


Yes 


Yes 


Optional 


Optional 


Optional 


2011 


Unspecified 


[8] 


DREME 


Implemented novel Discriminative Regular No 
Expression Motif Elicitation algorithm (DREME) 


No 


No 


Yes 


No 


2011 


4.9.1 


[9] 


MEME-ChiP 


Integrated existing tools: MEME and DREME 


No 


No 


No 


Yes 


No 


2011 


4.9.1 


[10] 


RSAT peak-motifs 


Implemented RSAT oligo-analysis, RSAT 
dyad-analysis, RSAT local-word analysis, 
MEME, ChlPMunk 


Yes 


No 


No 


Optional 


No 


2012 


Unspecified 


[11] 


PScanChIP 


Used existing Pscan algorithm 


Yes 


Yes 


No 


No 


No 


2013 


1.0 


[3] 
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sequence with length / where / is the size of an /-mer. A 
profile matrix is a matrix of 4 x / containing 4 rows for 
four nucleotides (A, C, G, T) and / columns. Each entry 
in the profile matrix is the frequency of each nucleotide 
in the alignment matrix. The consensus score is the sum 
of highest frequencies from each column in the profile 
matrix. The motif finding problem can be stated simply 
as follows. Given t input nucleotide sequences of length 
n, we want to find a set of /-mers with one from each se- 
quence such that they maximize the consensus score. 
Thus, we need to consider all (n - I + if possible starting 
positions or candidates for motifs. That is the number of 
candidates for motifs is exponential of the number of in- 
put sequences. In fact, motif finding is an NP-complete 
problem [12]. There are several different approaches for 
finding motifs such as profiles, consensuses, projection, 
graph representations, clustering, and tree-based [2,13,14]. 

Profiles 

This approach uses a Position Weight Matrix (PWM) 
for representing the frequency of four possible nucleo- 
tides appearing in each position of the motif [13]. The 
PWM is a matrix of 4 x / containing 4 rows for four nu- 
cleotides (A, C, G, T) and / columns where / is the size 
of the motif Using a PWM, the most likely location of 
the motif within each sequence can be calculated [13]. 

Some examples of profiles-based algorithms can be 
found in [15-19]. 

Consensuses 

In this approach, a consensus string is formed for each 
profile, which is constructed for each of the possible sets 
of starting locations in the alignment of the sequences. 
The best consensus with highest score is chosen to de- 
scribe the motifs in the sequences [13]. 

Some examples of consensus-based algorithms are 
WINNOWER [20], CONSENSUS [21], and ProfileB- 
ranching [22]. 

Projection 

This approach solves the (/, k) motif problem where each 
instance of a motif of length / differs from the original motif 
by exactiy k positions. These k positions are used as hashing 
functions for all possible contiguous sequences of / nucleo- 
tides. The potential motif sequences are put in the buckets 
based on their hashing functions. If the number of /-mers 
hash to the same bucket exceeds a threshold, they are 
considered as good candidates for motifs. The algorithm 
searches these buckets for the candidates of motifs [13]. 

Some examples of projection-based algorithms are 
PROJECTION [23] and Uniform Projection Motif Finder 
(UPMF) [24]. 



Grapli representations 

This approach recasts the motif finding problem into 
graph solving problem in which nodes correspond to 
substrings of input sequences and edges connecting 
nodes correspond to similar substrings [2]. Thus, the 
motifs can be found by detecting cliques [25] or max- 
imum density sub-graphs [26]. 

Clustering 

The motif finding can be transformed into finding the 
clusters in which the substrings of the input sequences 
forming the motif should be clustered together, and the 
rest should belong to a background cluster [2]. Thus, 
the cluster finding can be solved using appropriate clus- 
tering strategies like self organizing maps [27,28] . 

Tree-based 

This approach models the motif finding using tree-based 
data structure and uses tree-based algorithms to solve 
the motif detection. Al-Turaiki et al. modelled the motif 
finding problem using Trie data structure and trans- 
formed the motif finding into mining frequent patterns 
in large datasets [14]. Mohapatra et al. transformed the 
motif finding into generalized suffix tree and developed 
a tree-based algorithm for finding motifs [29]. 

Motif finding Web tools 

General features of motif finding Web tools 

The implementation of motif finding Web tools gener- 
ally falls into two categories. The first category is pipe- 
line implementation, which incorporates existing tools 
into a Web tool/Web service. The second category in- 
volves implementing novel algorithms into a Web tool/ 
Web service. Generally, motif finding Web tools allow 
uploading input sequences of DNA, protein, or binding 
sites. The users can customize the motif finding strategy 
before submitting the request. The results can be dis- 
played on the browser or can be downloaded. However, 
different motif finding Web tools provide different cus- 
tomizations for finding motifs as well as provide differ- 
ent result formats. Some Web tools have restrictions on 
the size of input sequence, the number of peaks, or the 
size of upload file. Others provide flexibility for input file 
formats and allow creating an account for storing the re- 
sults on the server. Some Web tools require email ad- 
dress for notifying the result. All motif finding Web 
tools have their own features for verifying discovered 
motifs with one or more motif reference databases such 
as JASPAR [30], TRANSFAC [31], CisView [32], UniP- 
robe [33], and user's reference. Some Web tools allow 
selecting one or more motif reference databases while 
others use their own pre-selected references. Some Web 
tools provide options for selecting the reference genome, 
motif size, and the number of motifs to return. 
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In the following section, we observe the features, ap- 
proach, strengths and weaknesses of each motif finding 
Web tool. 

MEME 

MEME [4] (Multiple EM for Motif Elicitation) is a Web 
service available on MEME suite [34]. MEME allows 
running motif detection on its Website or through sev- 
eral mirror sites. It can be downloaded and installed lo- 
cally. MEME is a de novo motif finding tool, which was 
designed for finding un-gapped motifs in unaligned 
DNA or protein sequences. MEME only accepts < 60,000 
characters in the input file, which must be in fasta for- 
mat. The input sequence's length should be < 1,000 bp 
and as short as possible. MEME suggests removing du- 
plicate sequences and sequences with low information 
that may not contain the motif prior to running the 
motif finding. MEME allows specifying the length of the 
motif and the number of motifs to return. It also allows 
entering the number of sites for each motif if there is a 
prior knowledge about the number of occurrences that 
the motif has in the dataset. MEME requires specifying 
how the user believes the occurrences of the motifs are 
distributed among the sequences, for example, zero or 
one per sequence. MEME includes the option in the re- 
sults on the browser for verifying discovered motifs with 
the reference database. Its initial version allowed verify- 
ing discovered motifs with JASPAR [30] or BLOCKS 
[35] reference database. In its later versions, MEME 
allows using TOMTOM [36] for verifying discovered 
motifs. MEME requires email address for notifying the 
results. It does not allow either creating an account or 
storing the results on the server. MEME includes other 
options such as performing discriminative motif discovery, 
uploading file containing a background Markov model, 
searching a given strand or both given strand and reverse 
strand, and looking for palindromes [4]. 

A summary of MEME's features can be found in 
Table 1. 

MEME provides three different output formats: HTML, 
XML, and text. The output shows the motifs as local mul- 
tiple alignments of the input sequences. It allows sending 
motifs to MAST [37] Web server for searching the se- 
quences that match discovered motifs. MEME also pro- 
vides other options in HTML output for forwarding one 
or all motifs to other Web-based programs for further 
analysis. For each motif, MEME outputs E-value, number 
of sites found, motif's logo, motif's blocks format, motif's 
block diagrams, position-specific scoring matrix, position- 
specific probability matrix, and so on [4]. 

MEME algorithm extends the expectation maximization 
(EM) algorithm [38]. The EM algorithm for motif finding 
presented by Lawrence et al. has the following drawbacks. 
It is not clear how to select the starting point and when to 



stop trying different starting points. It assumes there is 
exactly one appearance of the shared motif appearing 
in each sequence of the dataset but this is not always 
the case. MEME algorithm overcomes these limita- 
tions. MEME selects starting points based on all sub- 
sequences of sequences in the training dataset. It also 
eliminates the assumption of the shared motif appear- 
ing in each sequence. MEME removes the appearances 
of a motif after it is discovered and keeps searching for 
additional shared motifs in the dataset [38]. 

Because MEME erases previous discovered motifs when 
it searches for new motifs, MEME can only model a single 
motif at a time and it does not detect alternative binding 
motifs, which are motifs for co-factors. 

GLAM2 

We included this tool for finding consensus motifs [39] as 
there is a possibility of having deletion or insertion (indels) 
in the binding sites of the peak sequences from ChlP-Seq. 

GLAM2 [5] (Gapped Local Alignment of Motifs) is a 
de novo motif finding Web tool, which was designed for 
finding motifs with indels in unaligned DNA or protein 
sequences. The tool can be installed locally or can be 
run on MEME suite [34]. GLAM2 only accepts input se- 
quences in fasta format with < 60,000 characters in the 
input file. GLAM2 contains several features that can be 
customized for the motif finding. These features include 
aligned columns, alignment replicates, iterations without 
improvement, insertion, deletion, shuffling, and examin- 
ing forward and reverse strands. GLAM2 requires email 
address for notifying the results. However, it does not 
allow either creating an account or storing the results on 
the server [5]. 

A summary of GLAM2's features can be found in 
Table 1. 

GLAM2 provides three different output formats: HTML, 
text, and MEME text format. It outputs the best motifs 
found with their start and end positions, sites, strand, mar- 
ginal score, and motif's logo. GLAM2 has a scanning 
method called GLAM2SCAN, which is used for scanning 
the alignment of the motif results against sequence data- 
bases. This method is also included in the HTML output. 
GLAM2's HTML output contains an option for verifying 
discovered motifs with the references using TOMTOM 
[36] program. Other options in the HTML output include 
viewing alignment, viewing Position Specific Probability 
Matrix (PSPM), and finding replications that are similar to 
the best motif found [5]. 

The PSPM is a 4 X / matrix containing 4 rows for four 
nucleotides (A, C, G, T) and / columns where / is the 
size of the motif Each entry in the matrix is the fre- 
quency of a nucleotide in the multiple alignments of the 
sequences. This frequency is represented by a probability 
value. 
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GLAM2 implemented a generalization of the gapless 
Gibbs sampling algorithm. It examines the input se- 
quences and returns an alignment of segments of these 
sequences. Each sequence appears in at most one seg- 
ment of the alignment. GLAM2 assumes a motif is de- 
fined by residue preferences at certain positions called 
key positions. However, the key positions can be deleted 
or the residues can be inserted between these key po- 
sitions in a particular motif GLAM2 implemented a 
scoring scheme for alignments in which any identical 
residues or similar residues alignment happens in the 
same key position is rewarded while deletions and inser- 
tions are penalized. However, the penalty is not severe if 
deletions and insertions constantly occur in the same lo- 
cations. Using this scoring scheme, GLAM2 calculates 
the marginal score, which reflects how well each seg- 
ment matches the other segments. GLAM2 finds a 
motif alignment with maximum score using the scoring 
scheme. Because the number of possible alignments is 
too large, GLAM2 uses a heuristic optimization method 
called simulated annealing for finding the motif align- 
ment with maximum score. This method takes an initial 
alignment and constantly makes changes to it. These 
changes increase the score and also decrease the score. 
GLAM2 performs two types of changes called site sam- 
pling and column sampling. The changes are applied 
until the score fails to improve. To verify high score 
motif found, the whole procedure is repeated for a num- 
ber of times from different random starting alignments. 
GLAM2's performance can also be controlled by several 
given parameters as described above [5]. 

GLAM2 is time consuming and its running time scales 
linearly with the sequences length. GLAM2 works best for 
small datasets and short motifs. It is difficult for GLAM2 
to analyze sequences longer than a few thousand residues 
and it is impractical for GLAM2 to analyze sequences that 
are > 10,000 bp [5]. 

GLAM2 can only model a single motif at a time and it 
does not detect alternative binding motifs. 

CisFinder 

CisFinder [6] is a de novo motif finding Web tool for 
finding over-represented short DNA motifs. It imple- 
mented the novel CisFinder algorithm. The tool accepts 
input sequences in fasta format and plain text delimited 
format. CisFinder accepts four main file types: sequences, 
motifs, search results, and repeats. It was designed for pro- 
cessing large input dataset up to 50 Mb. CisFinder allows 
uploading the control file or using the public control file 
provided by the tool. CisFinder provides several analysis 
tools such as identifying motifs, improving motifs, 
clustering motifs, comparing motifs, showing motif, 
searching motif, and showing search results. It allows 
downloading and deleting each of four main file types. 



CisFinder provides several different parameters for cus- 
tomizing the motif finding. It does not allow specifying 
motif size but it allows selecting motif reference databases 
such as JASPAR [30], CisView [32], or user's reference. 
CisFinder allows using Guest account or setting up a user 
account for using the tool. Registered users can store the 
results on the server while Guest user has only one full 
session [6]. 

A summary of CisFinder's features can be found in 
Table 1. 

CisFinder's output can be in HTML and text formats. 
The output contains elementary motifs and cluster mo- 
tifs with both can be saved or downloaded. The elemen- 
tary motifs are listed by name, logo, pattern, frequency, 
enrichment ratio, information content of motif, score, 
FDR, and so on. Motifs cluster is listed by name, logo, 
pattern, number of motifs in cluster, frequency, enrich- 
ment ratio, information content of motif, score, FDR, 
palindrome, method of motif clustering, and so on [6]. 

CisFinder algorithm is based on the estimation of pos- 
ition frequency matrices (PFMs). This estimation is cal- 
culated from «-mer word counts in the test set and 
control set of sequences. CisFinder contains five main 
features. First, the algorithm is based on detecting over- 
represented short words in a sequence and clustering 
them. Second, the algorithm examines words with gaps 
and expands PFMs over the gaps and neighboring re- 
gions. Third, it uses real control sequences to compare 
against test sequences for processing repeat regions 
without removing repeat sequences because TF binding 
sites are often located in repeat regions. Fourth, it imple- 
ments exhaustive searches for all over-represented DNA 
motifs in a single run and combines motifs only at the 
clustering step. Finally, it includes several other functions 
such as comparing motifs with reference databases, search- 
ing for motifs that match PFMs, visualizing sequences and 
TF binding motifs with CisView [32] or UCSC genome 
browser [40], and extracting of sequence fractions and sub- 
sets of sequences [6] . 

CisFinder provides flexibility for input file formats and 
file types. It can process large datasets and provides sev- 
eral tools for motif analysis. CisFinder algorithm can 
accurately identify PFMs of TF binding motifs [6]. CisFin- 
der runs much faster than MEME [4], Weeder [41], and 
RSAT [6,42]. It can detect alternative binding motifs as 
well as binding motifs of potential co-factors. Finally, it 
can find motifs with a low level of enrichment [6]. 

W-ChlPMotifs 

W-ChlPMotifs [7] is a de novo motif finding Web tool 
for ChlP-based high throughput data. It only accepts in- 
put sequences in fasta format. W-ChlPMotifs does not 
specify either the maximum input file size or the max- 
imum sequence length. The tool does not have options 



Iran and Huang Biology Direct 2014, 9:4 
http://www.biology-direct.eom/content/9/l/4 



Page 6 of 22 



for specifying motif size and number of motifs to return. 
W-ChlPMotifs incorporated STAMP [43] tool for infer- 
ring phylogenetic information and verifying discovered 
motifs with the reference databases. It requires specify- 
ing human or mouse species, user's name, email, and 
transcription factor before submitting the request. The 
tool allows supplying the control file. However, it does 
not allow either creating an account or storing the re- 
sults on the server [7]. 

A summary of W-ChlPMotifs's features can be found 
in Table 1. 

The outputs of W-ChlPMotifs contain two files in 
PDF format via email only. One file contains found mo- 
tifs and the other contains matched similar motifs from 
STAMP. The discovered motifs are listed by name, logo, 
confidence level, PWMs, core and PWM scores, P-values, 
and Bonferroni correction P-value. The matching motifs 
from STAMP are listed by name, E-value, alignment, and 
logo. A phylogenetic tree for matching motifs is also in- 
cluded [7]. 

W-ChlPMotifs is based on the previous ChlPMotifs 
program [7]. The tool is a pipeline system, which incorpo- 
rated three motif finding tools: MEME [4], MaMF [44], and 
Weeder [45] for motif detection [7]. W-ChlPMotifs opti- 
mizes the significance of found motifs using bootstrap re- 
sampling method and Fisher test. It identifies about less 
than 10 candidate motifs for constructing n PWMs for each 
candidate motif Then, it uses a bootstrap re-sampling 
method to infer the optimized PWM scores. If the control 
data is not supplied W-ChlPMotifs uses the default control 
dataset based on the species selected by the user. It ge- 
nerates negative control dataset by randomizing the input 
sequences with each sequence for 100 times. The generated 
negative control dataset no longer corresponds to the 
original sequences but it shares the same nucleotide 
frequencies and it is used for scanning the identified 
motifs. W-ChlPMotifs uses Fisher test and P-value for 
identifying the significant cutoff for the scores [7] . 

W-ChlPMotifs can only model a single motif at a time 
and it does not detect alternative binding motifs. How- 
ever, it combines three existing motif finding tools for 
maximizing the chance obtaining true motifs. 

DREME 

DREME [9] (Discriminative Regular Expression Motif 
Elicitation) is a motif finding Web tool available from 
MEME suite [34]. It was designed for finding short (< 8 
bases), core DNA-binding motifs of eukaryotic TFs and it 
is able to process very large ChlP-Seq datasets [9]. 
DREME is capable for finding binding motifs for cofactor 
TFs. It only accepts input sequences in fasta format. 
DREME allows setting E-value cutoff but it does not allow 
specifying motif size. DREME includes the option in the 
output for verifying found motifs with reference databases 



using TOMTOM [36] program. DREME requires email 
address for notifying the results. It requires selecting 
comparison source, which is set to shuffled sequences 
by default. It allows specifying the type of strand to use. 
DREME does not allow either creating an account or 
storing the results on the server [9] . 

A summary of DREME's features can be found in 
Table 1. 

DREME provides three different output formats: HTML, 
XML, and text. The found motifs are listed by name, logo, 
and E-value. The motifs details include number of positive 
and negative strands matching that motif, P-value, E-value, 
and enriched matching words for that motif. DREME 
allows submitting discovered motifs to other programs 
within MEME suite [34] for further analysis. The found 
motif can be downloaded as a position weight matrix or 
a custom logo [9]. 

DREME algorithm is based on a simplified form of 
regular expression. Its motif detection is exhaustive for 
exact words and heuristic for words with wildcards. To 
identify the significant, discriminative motifs, the algo- 
rithm uses Fisher's Exact test for calculating the signifi- 
cance of relative enrichment of each motif in two sets of 
sequences. One set is the set of ChlP-Seq peak regions 
and the other is either similar data from a different 
ChlP-Seq experiment or shuffled versions of the first 
sequences. The algorithm counts only the number of se- 
quences containing a motif in each dataset. When the 
motif with highest significance is found, all of its non- 
overlapping occurrences in the first set of sequences are 
aligned to create a position specific probability matrix. 
To find multiple, non-redundant motifs in a set of se- 
quences, the algorithm erases the best motif found by 
setting all its occurrences to a special letter that cannot 
match any motif Then, the algorithm repeats the search 
for motifs [9]. 

DREME is much faster than MEME [4], Weeder [41], 
and NestedMICA [9,46]. Its runtime scales linearly with 
the size of the dataset [9]. 

MEME-ChIP 

MEME-ChIP [10] is a Web service designed for analyz- 
ing ChlP-Seq datasets and it is available from MEME 
suite [34]. MEME-ChIP provides several analysis tools 
such as motif discovery, motif enrichment analysis, motif 
visualization, binding affinity analysis, and motif identi- 
fication. MEME-ChIP is a pipeline system, which in- 
corporated MEME and DREME into a Web service. 
MEME-ChIP only accepts input sequences in fasta for- 
mat. It does not have restrictions on the size of input 
sequence and the number of upload sequences. Thus, 
MEME-ChIP can analyze very large ChlP-Seq datasets. 
It allows setting E-value cutoff as well as selecting motif 
size and number of motifs to return. MEME-ChIP allows 
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verifying found motifs with several motif reference da- 
tabases. It provides universal options, MEME options, 
DREME options, and CentriMo [47] options for cus- 
tomizing the motif detection. MEME-ChIP requires 
email address for notifying the results. However, it 
does not allow either creating an account or storing 
the results on the server [10]. 

A summary of MEME-ChlP's features can be found in 
Table 1. 

MEME-ChIP provides three different output formats: 
HTML, XML, and text. The output can be viewed in 
MEME output format, DREME output format, as well as 
in CentriMo [47] and TOMTOM [36] report formats [10]. 

MEME-ChIP incorporated two complementary motif 
finding algorithms MEME and DREME [10]. MEME im- 
plemented multiple EMs while DREME used the regular 
expression approach. DREME is capable for detecting 
very short motifs that are not found by MEME. MEME- 
ChIP used TOMTOM [36] for verifying discovered mo- 
tifs by MEME and DREME with the reference databases 
[10]. MEME-ChIP also used AME algorithm [48] for de- 
tecting very low levels of enrichment of binding sites 
for motif enrichment analysis [10]. MEME-ChIP used 
MAST [37] and AMA [49] algorithms for visualizing 
motifs as well as for binding strength analysis [10,48]. 

CompleteMOTIFs 

CompleteMOTIFs [8] is a de novo motif finding Web 
tool, which was designed for finding over-represented 
transcription factor binding motifs from ChlP-Seq. Com- 
pleteMOTIFs is a pipeline system, which incorporated 
MEME [4], Weeder [45], and ChlPMunk [50] into a 
Web tool. It accepts input sequences in fasta, BED, and 
GFF formats. CompleteMOTIFs accepts file's size < 
500,000 bp for MEME and Weeder. It accepts < 
5,000,000 bp in input file for ChlPMunk. Complete- 
MOTIFs allows selecting motif reference database as 
well as allows supplying user's reference in Position 
Specific Scoring Matrices. It also requires specifying the 
type of the background sequence and the reference gen- 
ome used in the motif finding. Other options for custom- 
izing the motif detection include setting P-value cutoff, 
specifying the types of nucleotides shuffling, and the num- 
ber of times for nucleotides shuffling. CompleteMOTIFs 
allows specifjdng motif size for running MEME only. It 
does not allow specifying the number of motifs to return. 
CompleteMOTIFs allows using Guest account or setting 
up a user account for using the tool. Registered users can 
store the results on the server. CompleteMOTIFs also 
provides annotation analysis and eight boolean logic oper- 
ations for file manipulation. It also provides two utilities: 
convert BED to fasta, and convert fasta to BED [8]. 

A summary of CompleteMOTIF's features can be found 
in Table 1. 



CompleteMOTIF's output can be in both HTML and 
text formats. The output can be viewed in MEME [4], 
Weeder [45], and ChlPMunk [50] formats depending on 
the selections when submitting the request. The motif 
results can be verified with JASPAR [30] and TRANS- 
FAC [31] databases using Patser [21] scanning method. 
The top 10 motifs with their logos can be viewed on the 
browser. The tool also shows the motif clustering result 
from STAMP [43]. All results can be downloaded in a 
zip file [8]. 

CompleteMOTIF incorporated three existing motif find- 
ing tools into a Web tool. However, the results are specific 
to each tool selected by the user. Each tool has 
its own approach for finding motifs. MEME used the 
multiple EMs algorithm [4] . Weeder implemented a suf- 
fix tree based exhaustive enumeration algorithm [45]. 
ChlPMunk implemented an iterative algorithm that com- 
bines greedy optimization with bootstrapping [50] . 

RSAT peak-motifs 

Peak-motifs [11] is a pipeline system for finding motifs 
in ChlP-Seq data. It can be used as a stand-alone appli- 
cation and Web services, peak-motifs provides several 
selective categories for customizing the motif detection 
as follows [11]. 

Uploading input Peak-motifs accepts different types of in- 
put sequences such as raw, multi, tab, fasta, wconsensus, 
and IG formats. The input sequences can be uploaded 
in a .gz compressed file, peak-motifs can also take the 
input from other Web server via URL. The input se- 
quences can be masked into lowercase, uppercase, or 
non-dna. peak-motifs does not have limitations for the 
size of the sequence and the number of peaks in the input. 
It also allows uploading the control sequences [11]. 

Reducing peak sequences Peak-motifs provides flexibil- 
ity for selecting the number of top sequences to retain 
for the motif finding. It allows reducing peak sequences 
by a number of base pairs on each side of the peak cen- 
ter for the motif detection [11]. 

Motif discovery parameters Peak-motifs provides op- 
tions for finding over-represented words, words with a 
positional bias, words with local over-representation, 
and over-represented spaced word pairs. It allows select- 
ing oligomer lengths 6, 7, and 8 characters, peak-motifs 
also includes several selections for Markov order of the 
background model. The users can select between 1 to 10 
motifs to return per algorithm as well as selecting a sin- 
gle strand or both strands for the motif detection, peak- 
motifs provides several options for selecting different 
reference databases including user's database and known 
reference motifs for verifying discovered motifs [11]. 
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Locating and visualizing motifs Peak-motifs allows 
searching putative binding sites in the peak sequences. It 
includes several options for selecting Markov order of the 
background model for sequence scanning. It also allows 
visualizing peaks and sites on the genome browser [11]. 

Output option Peak-motifs provides two output options: 
displaying the results on the browser or emailing the results 
to the user. The latter requires user's email address [11]. 

A summary of peak-motifs's features can be found in 
Table 1. 

All motif results can be downloaded in a zip file. All 
matrices can be downloaded in TRANSFAC format, 
peak-motifs displays detailed results in several different 
categories such as sequence compositions and statistics, 
number of discovered motifs by algorithm, number of 
discovered motifs with motif comparison, individual mo- 
tifs and their matrices, motif locations or sites, and motif 
comparisons [11]. 

Peak-motifs is a computational pipeline that incorpo- 
rated several algorithms. The algorithms used for motif 
finding are RSAT dyad-analysis [51], RSAT local-word 
analysis [52], MEME [4], and ChlPMunk [11,50]. peak- 
motifs also implemented the pattern matching algorithm 
called matrix-scan-quick from RSAT [11,53]. It used 
RSAT compare-motifs algorithm for motif comparison. 
The implementation of motif finding relies on a combin- 
ation of tried and tested algorithms, which integrated in 
the software suite RSAT. The motif finding also used 
complementary criteria for detecting the motifs [11]. 

PscanChIP 

PscanChIP [3] is a motif finding Web tool for ChlP-Seq 
data. It only accepts input sequences in BED format. 
PscanChIP assumes that the region is centered on the 
point of maximum enrichment within the peak and it 
only analyzes 150 bp around the summit for that region. 
It does not provide options for selecting motif size and 
the number of motifs to return. PscanChIP requires 
selecting human or mouse species with its associated as- 
sembly. It allows selecting the background model and 
the motif reference databases such as JASPAR [30], 
TRANSFAC [31], and user's database. PscanChIP does 
not allow either creating an account or storing the re- 
sults on the server [3]. 

A summary of PscanChlP's features can be found in 
Table 1. 

PscanChlP's output can be in HTML and text formats. 
The results include several categories such as binding 
profile name, binding profile ID, local enrichment P- 
value, local over- or under- representation, global en- 
richment P-value, global over- or under- representation. 
Spearman correlation coefficient, preferred position, 
position bias P-value, and so on. For each matrix in the 



results, PscanChIP shows matrix's detailed information, 
its position weight matrix (PWM), motif's logo, and all 
occurrences [3]. 

PscanChIP is based on the previous Pscan tool for pro- 
moter analysis. It computes the global enrichment, which 
is used for identifying motifs that are overrepresented in 
the regions. It also calculates local enrichment, which is 
used for identifying motifs with significant preference for 
binding within the regions. In addition, PscanChIP evalu- 
ates motif positional bias within the input regions. It can 
identify the actual binding sites for the TF and the second- 
ary motifs corresponding to other TFs that tend to bind 
the same region [3]. 

Peak calling tools 

There are many factors, which can affect the result of 
the motif finding such as quality of the antibody used, 
read length, sequencing error, read mapping procedure, 
peak caller, and so on. Here we only mentioned the clos- 
est influence factor, which is the peak calling tool in this 
section. We recommend the users to select the peak 
calling tool that is relevant to the type of research being 
conducted. We also provided a summary for a number 
of peak calling tools in Table 2. Besides, the control data 
is important for background model validation. Thus, it is 
better to run the peak calling tool using both input and 
control data. 

The peak finding process contains three essential steps: 
pre-processing, mapping, and peak finding [78]. The 
pre-processing step removes erroneous and low quality 
reads. The mapping step maps the reads back to the ref- 
erence genome. It is critical as multiple reads can be 
mapped to multiple locations in the genome. Thus, the 
mapping can be handled by increasing the specificity 
using unique reads only or increasing the sensitivity by 
allowing multiple alignments of reads. Finally, the peak 
finding step identifies significant peak signals among 
background signals [78]. 

Several algorithms have been developed for identifying 
true peaks. There are three types of peaks in ChlP-Seq 
data: punctate regions contain a few hundred base pairs 
or less, localized but broader regions contain up to a 
few kilobases, and broad regions contain up to several 
hundred kilobases. Different peak categories associate 
with different types of binding events. For example, 
punctate region is a signature of a sequence specific 
transcription factor such as NRSF or CTCF. A combin- 
ation of punctate and broader regions associates with 
proteins such as RNA polymerase II. Broad regions can 
associate with histone marks and other chromatin do- 
main signatures [79]. 

Different peak finding tools implement different algo- 
rithms for targeting these types of peaks. Thus, the users 
should select a peak finding tool that is relevant to the 
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Published Language Operating Software 
year system features 



Latest Latest 
release release 
version year 



Website 



Maintenance Ref. # 



BayesPeak 



BroadPeak 



CisGenome 



DROMPA (DRaw 
and observe 
Multiple enrichment 
profiles and 
annotation) 

F-Seq 



FindPeaks 



GEM (Genome 
wide Event 
finding and 
Motif discovery) 

GLITR (Global 
Identifier of 
Target Regions) 

GLMNB (Negative 
binomial 
generalized 
linear model) 

Hpeak (Hidden 
Markov model 
(HMM)-based 
Peak-finding 
algorithm) 



BayesPeak algorithm 



Maximal-segment 
algorithm, Gibbs 
sampling algorithm, 
Ruzzo-Tompa 
algorithm 

Two-pass algorithm 



Sliding window 



F-Seq density 

estimation 

algorithm 

Used direaional 
reads module for 
identifying peaks 

Genome wide 
event finding 
and motif 
discovery (GEM) 

GLITR algorithm 



Sliding window 



HMM-based 
algorithm 



Used Hidden 
Markov model 
(HMM) for 
finding peaks 

Probabilistic 
model 



mplemented a 
modular design, 
use sliding 
window for peak 
deteaion 

Two-step 

procedure, 

DROMPA 

peak-calling 

program 

Kernel density 



Implemented a 

modular 

architecture 

Probabilistic 
model 



Used ChlP-Seq 
Peak Finder 
framework 

Generalized 
Linear Model 
with Negative 
binomial 
distribution 

Hidden Markov 
Model (HMM) 



2011 



2013 



R and C Linux, Windows, 
and Mac OS X 



N/A 



2013 



2008 



2008 



2012 



2009 



2012 



2010 



C,C++ 



ANSI-C 



Java 



Java 



Java 



Peri and 
Python 

N/A 



Windows, Mac, 
and Linux 



Linux 



Unix, Unux 



Linux, Windows, 
and Mac OS X 



N/A 



N/A 



N/A 



Perl and Unux, Windows, 
C++ and Mac OS 



Support 
multicore 



N/A 



Stand-alone 
system, 
command 
mode and 
GUI 

N/A 



N/A 



Command 
line 

Stand-alone 
software 



N/A 



N/A 



N/A 



1.12.0 N/A 



One 2013 
version 



v2.0 201 1 



1.4.0 2013 



1.84 



4.0 



23 



N/A 



1.0 



V2.1 



2011 



N/A 



2013 



N/A 



2012 



2009 



httpZ/compbio. 

sysbiol.cam.ac.uk/ 

Resources/Bayes 

Peak/csbayespeak 

html 

httpZ/jordan. 
biology.gatech. 
edu/page/ 
software/ 
broad peak/ 

httpy/www. 
biostat.jhsph. 
edu/~hji/ 
cisgenome/ 

http//www.iam. 

u-tokyo.ac.jp/ 

chromosome 

informatics/ 

rnakato/drompa/ 

httpZ/fureylab. 
web.unc.edu/ 
software/fseq/ 

httpZ/vancouver 

shortr.sourceforge. 

net/ 

httpZ/cgs.csail. 
mit.edu/gem/ 



N/A 



httpZ/sourceforge. 

net/projects/ 

gimnb/ 



http//www.sph. 
umich.edu/csg/ 
qin/HPeak/ 



Yes 



Yes 



Yes 



Yes 



Yes 



Yes 



Yes 



N/A 



N/A 



N/A 



[54] 



[55] 



[56] 



[57] 



[58] 



[59] 



[60] 



[61] 



[62] 



[63] 
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Table 2 A summary of peak calling tools (Continued) 



MACS (Model- 
based analysis 
ofChlP-Seq) 

NEXT-peak (the 

normal-exponential 

two-peak) 

PeakRanger 



PeakSeq 



MACS algorithm 
(use shift and 
sliding window 
algorithm) 

NE>Cr-peak 
algorithm 



Same algorithm 
as PeakSeq for 
identifying broad 
regions. Summit- 
valley-alternator 
algorithm 

PeakSeq - two- 
pass strategy 



Model-based 
Analysis of 
ChlP-Seq 

Normal- 
exponential 
two-peak (NEXT- 
peak) model 

Build the read 
coverage profile 



Two-pass 
strategy 



2008 



2013 



2011 



2009 



Python Linux 



C++ Linux 



C-n- Linux, Mac 

OS, and 
Windows 



C and Perl N/A 



QuEST (Quantitative 
Enrichment of 
Sequence Tags) 



SeqSite 



SICER 



Construa profiles 
and use shifting 
method 



Two-step strategy: 
detea tag-enriched 
regions and then 
pinpoint binding 
sites in the de 
tected regions 

Scoring scheme 



Statistical 
framework- 
Kernel Density 
Estimation 
approach 

Poisson model 



Spatial clustering 
approach 



2011 



2009 



C-n- Linux, Mac 

OS 



C/C++ Windows, 
Mac OS X, 
and Linux 



Python Linux, Unix 



SIPeS (Site 
Identification 
from Paired- 
end Sequencing) 



SIPeS algorithm 



Used dynamic 
fragment pileup 
value for peak 
calling 



2010 



Linux 



SISSRs (Site 
Identification 
from Short 
Sequence 
Reads) 

Sole-Search 



Site Identification 
from Short 
Sequence Reads 
(SISSRs) algorithm 

Sole-Search 
program 



Sliding window 2008 



Implemented 
several different 
analysis steps 
for peak calling 



Perl 



Linux, UNIX 



2010 



Java N/A 



stand-alone, 
no GUI, open 
source 

N/A 



1.4.2 



1.1 



Support parallel 1.16 

cloud 

computing 



2012 



2013 



201 : 



http//liulab.dfci. 

harvard. 

edu/MACS/ 

httpy/ww2.odu. 

edu/~nxkim/ 

nextpeak/ 

httpy/ranger. 
sourceforge.net/ 



Yes 



Yes 



Yes 



[64] 



[65] 



[66] 
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N/A 



1.1 



Open source, 2.4 
non-profit use 



2011 



2009 



httpy/info. 

gersteinlab. 

org/PeakSeq 

httpZ/mendel. 
stanford.edu/ 
SidowLab/down 
loads/quest/ 



N/A 



No 



[67] 



Academic use 1.1.2 
only 



2010 



httpZ/bioinfo.au. 
tsinghua.educn/ 
software/seqsite/ 



Yes 



N/A 



vl.- 



Non-profit use 2.0 



2011 



2010 



N/A 



vl4 



httpZ/home.gwu. 

edu/~wpeng/ 

Software.htm 

http//gmdd. 

shgmo.org/ 

Computational- 

Biology/ChlP-Seq/ 

download/SIPeS 

httpZ/sissrs. 
rajajothi.com/ 



Yes 



N/A 



N/A 



[70] 



[71] 



[72] 



Web-based N/A 
software 



N/A 



N/A 



No 



[73] 



Table 2 A summary of peak calling tools (Continued) 



T-PIC aree 
shape Peak 
Identification 
for ChlP-Seq) 

USeq 



W-ChiPeal<s 



ZiNBA (Zero- 
Inflated 
Negative 
Binomial 
Algorithm) 



Tree shape Peak 
Identification for 
ChlP-Seq (l-P\C) 
algorithm 

Collection of 
algorithms and 
software for 
peak calling 

PELT algorithm 
and BELT 
algorithm 

Zero-Inflated 

Negative 

Binomial 

Algorithm 

(ZINBA) 



Tree-based 
statistics 



Implemented 
several different 
methods for 
peak calling 

Statistical 
methods 
control false 
discovery rate 

Statistical 
framework 



2011 



2008 



2011 



2011 



R and Perl 



Java 



PHP Perl, 
Java and 
C++ 

Cand R 



N/A 



Linux, Mac 
OS X, and 
Windows 



N/A 



Mac OS X 
and Linux/ 
Unix 



N/A 



GUI 



One 
version 



Web tool 



1.01 



2011 



2013 



201 : 



httpy/www.math. 

miami.edu/- 

vhower/tpic.html 



N/A 



httpy/useq.source Yes 
forge.net/ 



httpZ/motif Yes for 

bmi.ohio-state. BELT only 
edu/W-ChlPeaks/ 



[74] 



[75] 



[76] 
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Support 
multi-core 

clusters 



2.02.03 



2012 



httpZ/code. 

google.com/ 

p/zinba/ 



Yes 



[77] 
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type of research being conducted for maximizing the 
chance obtaining the best possible peak sequences for 
finding the motifs. There are software tools such as 
peakROTS [80] and the tool presented by Schweikert 
et al. [81] that are capable for assisting the users for op- 
timizing the peak calling as well as choosing relevant 
software package for their analysis. The users may need 
to consider these tools. Here we provide an overview for 
each tool and hope the users may find it useful. 

peakROTS implemented a generic data-adaptive pro- 
cedure that allows to optimally adjust the parameters 
of a given software package to the properties of each 
ChlP-Seq dataset independently. It allows avoiding poor 
parameter settings for a given dataset. It can provide dir- 
ection for selecting peak calling parameters. It notifies 
the users whether or not the quality of the data and/or 
the software parameters of a selected software package 
are sufficient for reliable binding site detections. It also 
recommends the users to choose the package that is op- 
timal for a given dataset [80]. 

Schweikert et al. presented a tool, which implemented 
a combination and fusion analysis method. This tool 
provides a general assessment of available technologies 
and systems for assisting researchers to select a suitable 
system for their ChlP-Seq analysis. It also offers an alter- 
native approach for increasing true positive rates and de- 
creasing false positive rates. The tool can take different 
peak sequence outputs of the same dataset generated by 
different peak calling tools. It analyzes these peak se- 
quence outputs and combines them in such a way that it 
can produce a better output from all peak sequence out- 
puts it analyzes. Then, the improved peak sequence out- 
put can be used for further analysis [81]. 

Results and discussion 
Datasets 

We used five datasets from ChlP-Seq experiments in 
Shen et al. [82] in Table 3 for our motif discovery. These 
datasets came from mouse liver tissues, which have been 
sequenced on Illumina Genome Analyzer II and aligned 
to the mouse reference genome mm9. The output align- 
ments are in bam format [82]. 



We ran MACS [64] on each dataset for obtaining the 
output peak file in bed format using P-value cutoff 
0.00001 for peak detection. However, these peak se- 
quence datasets are large and different motif finding 
Web tools accept the datasets with different limited 
sizes. Thus, we reduced the size of these datasets appro- 
priately so that they can be accepted by the motif finding 
Web tools. In addition, each motif finding Web tool ac- 
cepts different formats for peak sequence dataset. There- 
fore, we prepared the format for each peak sequence 
dataset appropriately for each motif finding Web tool. We 
used a utility BED to fasta conversion from CompleteMo- 
tifs [8] for converting the peak sequence outputs from 
MACS to fasta format for the Web tools that only accept 
fasta format. The details for each dataset are in Table 3. 

Results 

We used two small datasets DM230 and DM05 in fasta 
format for running MEME [4], GLAM2 [5], W- 
ChlPMotifs [7], and CompleteMOTIFs [8] as these 
tools are unable to process large datasets. The parame- 
ters used for running MEME for both datasets are in 
Additional file 1: Table SI. We used all default parame- 
ters provided by GLAM2 for running both datasets. 
These parameters can be found in Additional file 1: 
Table S2. For running W-ChlPMotifs, we selected 
mouse species and left the transcription factor blank 
for running both datasets. For running CompleteMO- 
TIFs, we used the parameters in Additional file 1: Table S3 
for both datasets. As of this writing, the motif finding jobs 
for both datasets have not completed by Complete MO- 
TIFs although these jobs have been submitted for over 
two months. 

We used all five datasets in fasta format for running 
CisFinder [6], DREME [9], MEME-ChIP [10], and 
peak-motifs [11]. The parameters used for running all 
five datasets for these Web tools are in Additional file 
1: Tables S4, S5, S6 and S7. CisFinder produced a large 
number of motifs for each dataset as it can detect local 
over-represented motifs, alternative binding motifs, bind- 
ing motifs of potential co-factors, and motifs with a low 
level of enrichment. 



Table 3 Dataset's properties 



Dataset 


Mark 


GEO 

accession 


Number of 
sequences 


Shortest 

sequence 

(residues) 


Longest 

sequence 

(residues) 


Total lengtli 
(residues) 


Size (FASTA 
format) 


Size (BED 
format) 


Reference 


DM230 


Polll (RNA polymerase II) 


GSM722763 


105 


157 


1728 


47242 


49 KB 


5 KB 


[82] 


DM05 


p300 (co-activator protein) 


GSM722762 


142 


130 


1214 


50318 


53 KB 


7 KB 


[82] 


DM254 


CrCF (insulator binding protein) 


GSM722759 


4009 


94 


2374 


1518265 


1604 KB 


181 KB 


[82] 


DM01 


H3K4me1 (histone H3 lysine 4 
monomethylation) 


GSM722760 


2001 


175 


8520 


1856431 


1871 KB 


88 KB 


[82] 


DM721 


H3K27ac (H3 lysine 27 acetylation) 


GSM851275 


4005 


255 


16542 


5429909 


5423 KB 


180 KB 


[82] 
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We also used all five datasets in bed format for run- 
ning PScanChIP [3]. The parameters used for running 
these datasets are in Additional file 1: Table S8. PScan- 
ChIP outputted all global and local over-represented 
motifs with their global and local P-values for each 
motif We used P-value < 0.05 as a threshold for filtering 
both global and local over-represented motifs in the re- 
sults. The number of global over-represented motifs and 
local over-represented motifs after applying the filter for 
each dataset are in Additional file 1: Table S9. A sum- 
mary of all results reported by each Web tool are also in 
Additional file 1: Table S9. 

Discussion 

It is difficult to compare motif results from different 
motif finding tools even for the same peak sequence 
dataset because of the following reasons. Different motif 
finding Web tools implement different algorithms, which 
determine the results of the motif finding. In addition, 
each motif finding Web tool has its own parameters set 
up for finding motifs. The default parameters and the 
parameters selected by the users have an influence on 
the motif results. Thus, Tompa et al. suggested using a 
combination of different motif finding tools for maxi- 
mizing the chance obtaining significant motifs [83]. 
Moreover, motifs reported by multiple tools are more re- 
liable. On the other hand, multiple motif finding tools 
that implement different algorithms report identical mo- 
tifs for the same dataset prove the consistency and reli- 
ability of these tools. However, in reality it is hard for 
these motif finding tools to agee on the same set of mo- 
tifs that are exactly matched. Thus, we looked for simi- 
larities between these motifs reported by different motif 
finding tools. We used STAMP [43] for this purpose by 
comparing the similarities between two set of motif re- 
sults from two different motif finding Web tools. We 
implemented this pair-wise comparison for all motif 
finding Web tools for each dataset. Since STAMP has its 
own required input formats and the formats of the motif 
results from different motif finding Web tools vary, we 
prepared the motif results in the formats required by 
STAMP for running this tool. Besides, different motif 
finding Web tools provide different settings for getting 
the maximum number of motifs to return. Thus, we ob- 
tained a variety number of motifs in the results from 
these tools for each dataset. Among them CisFinder re- 
ported the largest number of motifs. However, STAMP 
is not able to process large motif datasets. Hence, we re- 
duced CisFinder's motif datasets to < 100 motifs in each 
dataset for STAMP to process. 

We validated all motifs used for similarity comparisons 
with two reference databases: JASPAR [30] and UniP- 
robe [33] for mouse species using TOMTOM [36] pro- 
gram with P-value cutoff < 0.01. All discovered motifs in 



each dataset by MEME, GLAM2, W-ChlPMotifs, 
MEME-ChIP, and PScanChIP were found in either 
JASPAR or UniProbe. All discovered motifs by Cis- 
Finder for four datasets DM230, DM05, DM254, and 
DM721were found in either JASPAR or UniProbe ex- 
cept for one motif in the dataset DM01 was not found 
both databases. In addition, all discovered motifs by 
DREME for three datasets DM230, DM254, and DM721 
were found in either JASPAR or UniProbe except for 2 
motifs in the dataset DM01 were not found in both data- 
bases. Besides, RSAT peak-motifs showed two motifs that 
were not found in both references with one from the data- 
set DM254 and the other from the dataset DM01. All 
other discovered motifs by RSAT peak-motifs in other 
datasets were found in either JASPAR or UniProbe. In 
general, most of discovered motifs reported by each tool 
in each dataset used for similarity comparisons were 
found in the references for mouse species. All validation 
results can be found in column 4 of the Additional file 1: 
Table Sll. 

We performed the similarity comparisons as follows. 
For each Web tool, we compared its motif result with the 
motif result in every other Web tool for the same dataset 
using the matrix type in Additional file 1: Table SIO. We 
performed this pair-wise comparison for all datasets for 
each Web tool. The pair-wise comparisons of motif results 
between these tools for the same dataset reveal the num- 
ber of best matches by similarities between them. How- 
ever, the resemble matches may not be one to one 
correspondence. The comparison results are in Additional 
file 1: Table Sll. Most of discovered motifs by MEME 
were also found by CisFinder and W-ChlPMotifs. Besides, 
most of discovered motifs by GLAM2 were also found by 
all other tools except for MEME-ChlP. 

For two small datasets DM230 and DM05, nearly all 
discovered motifs by CisFinder were also found by all 
other tools except for MEME-ChlP. For other three 
datasets, most of discovered motifs by CisFinder were 
also found by DREME and MEME-ChlP. However, 
peak-motifs and PScanChIP do not show a large number 
of similar motifs with CisFinder. 

The output of W-ChlPMotifs includes the frequencies 
of nucleotides but they are not in the form of matrices. 
Thus, we converted these frequencies into raw PSSMs 
[84], which were used to compare with the motif results 
from other Web tools. Raw PSSM is defined in [84] as 
follows. It is an 1x4 matrix containing 4 columns for 
four nucleotides (A, C, G, T) and / rows for the size of 
the motif Each entry in the matrix is the frequency 
value of a nucleotide in the multiple sequence align- 
ments. The matrix is leaded by a character ">" followed 
by some characters, which can be the name of the matrix. 
The results show most of discovered motifs by W- 
ChlPMotifs were also found by MEME and CisFinder. 
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However, other tools do not show a significant number of 
similar motifs with W-ChlPMotifs. 

DREME returned only one motif for the dataset 
DM230. This motif was found by all other tools except 
for MEME-ChlP. DREME did not return any motif for 
the dataset DM05 although other tools reported a num- 
ber of motifs for this dataset. For other three datasets, 
most of discovered motifs by DREME were also found 
by all other tools. However, PScanChIP does not show a 
large number of similar motifs with DREME. 

MEME-ChIP integrated MEME and DREME into a 
pipeline, which maximizes the chance for obtaining the 
motifs that a single tool may miss because these tools 
are complement to each other. We used the parameter 
settings for this tool as used for running individual tool. 
However, MEME-ChIP did not report any motif for the 
dataset DM230 although MEME returned 20 motifs and 
DREME returned one motif. For the dataset DM05, 
MEME-ChIP returned 4 motifs, which were found by 
MEME but other tools do not show much similarities 
for this dataset. For other three datasets, most of discov- 
ered motifs by MEME-ChIP were also found by all other 
tools. However, PScanChIP does not show consistent 
motif similarities with MEME-ChIP. 

For peak-motifs, most of discovered motifs by this tool 
for all datasets were also found by CisFinder. However, 
other tools do not show a lot similar motifs with peak- 
motifs. 

PScanChIP does not allow exporting the motif results 
in matrix format for further analysis. To acquire the 
motif results in PSSMs format, we manually followed 
each motif's link to the JASPAR database site for obtain- 
ing the corresponding matrix in JASPAR format for each 
motif. The comparison results show most of discovered 
motifs by PScanChIP for all datasets were also found by 
CisFinder. However, other tools do not show much simi- 
lar motifs with PScanChIP. 

In general, CisFinder shows consistent results compar- 
ing to the results from other tools as it produced a large 
number of motifs for each dataset. The capability to de- 
tect large number of motifs makes CisFinder consistent 
as some Web tools missed reporting the motifs that 
were found by others. We suggest the users to use mul- 
tiple Web tools that implement different algorithms for 
their motif finding for obtaining significant motifs, over- 
lapping resemble motifs, and non-overlapping motifs. 

To date peak-motifs is the only Web tool that can take 
the input from other Web server via URL. This feature 
eliminates the uploading delay and speeds up the motif 
finding. 

Conclusions 

In this work, we surveyed nine motif finding Web tools 
that are capable for finding binding site motifs. For each 



Web tool, we observed its features, approach, strengths 
and weaknesses. We pointed out the results of motif 
finding depend on several factors and discussed the clos- 
est influence factor, which is the peak calling tool. We 
presented that different peak calling tools implement dif- 
ferent algorithms for targeting different types of peaks. 
Thus, it is critical for the users to pick a suitable peak 
calling tool for the type of research being conducted so 
that it can maximize the chance for obtaining the best 
possible peak sequences for finding the motifs. We also 
presented the tools that are able to assist the users for 
optimizing peak calling result as well as for choosing 
relevant software package for their analysis. We also per- 
formed comparisons for nine motif finding Web tools 
using five different datasets from ChlP-Seq experiments. 
We showed that comparing motif results from different 
motif finding Web tools is difficult because each tool 
has its own parameter settings as well as implementing 
different algorithms for finding motifs. In addition, the 
default parameter settings and user's selected parameters 
have an influence on the motif results. Thus, we com- 
pared the motif results from different motif finding Web 
tools based on their similarities using STAMP [43] tool. 
We performed pair-wise comparison between two set of 
motifs from two different Web tools for all datasets. The 
comparison results showed CisFinder reported consist- 
ent results comparing to other Web tools as it was able 
to detect a large number of motifs that were not re- 
ported by other Web tools. Since each motif finding 
Web tool has its own advantages for detecting motifs 
that other Web tools may not discover, we suggested the 
users to use multiple Web tools that implement different 
algorithms for obtaining significant motifs, overlapping 
resemble motifs, and non-overlapping motifs. 

We observed that newer motif finding Web tools have 
the capability to find global over-represented motifs, 
local over-represented motifs, and alternative motifs. 
These newer tools can process large datasets with long 
sequences. We also observed that recent motif finding 
development tends to exploit the Web for providing ease 
of use to the users. 

Future work 

From the observations above, we see that the future of 
the motif finding development for ChlP-Seq should be 
Web tool design with user-friendly interface. It should 
be developed as a pipeline system, which integrates a 
number of specialized motif finding tools for ChlP-Seq. 
Such system would allow the users to run a combination 
of specialized tools for maximizing the chance obtaining 
significant motifs, overlapping resemble motifs, and non- 
overlapping motifs. The future tool should be able to de- 
tect global over-represented motifs, local over-represented 
motifs, and alternative motifs. It should be able to process 
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large datasets with long sequences generated from the 
NGS technology. The future tool should be able to take 
the input from other Web server via URL for circumvent- 
ing the uploading delay and speeding up the motif detec- 
tion. It is also a plus if the future tool can probe the user 
for the type of research being performed and provide ad- 
visory features prior to running the tool. Finally, the future 
tool should provide a number of convenient result formats 
for further analysis. 

Reviewers' comments 

First round 
Reviewer's report 1 

Prof. Sandor Pongor, International Centre for Genetic En- 
gineering and biotechnology (ICGEB), Italy 

The written English of the manuscript could be further 
improved. The subtitle "Graph" could be replaced by 
"Graph representations", or "Graph Theory". 

Author's response: 

We have revised the manuscript and further improved 
the language editing. We replaced "graph" with "graph 
representations" in the first paragraph of the Section 
General approaches for motif finding and revised the 
subtitle "Graph" to "Graph representations" in this same 
section. Below are the changes in the manuscript. 

First paragraph of the Section General approaches for 
motif finding: 

"...There are several different approaches for finding 
motifs such as profiles, consensuses, projection, graph rep- 
resentations, clustering and tree-based [2,13,14]." 

In Section General approaches for motif finding, sub- 
title Graph changed to Graph representations. 

There are too many Tables. I suggest Tables 3 and 4 to 
be combined into one, and Tables 5 to 12 to go into 
Supplementary. 

Author's response: 

We combined Tables 3 and 4 into one table named 
Table 3 and adjusted the text in the body of the manu- 
script referring to this table. We also moved Tables 5-12 
to the Supplementary Tables file and renamed these ta- 
bles to Supplementary Tables 1-11 respectively. 

Quality of written English: Needs some language cor- 
rections before being published. 

Reviewer's report 2 

Dr. Yuriy Gusev, Georgetown University Medical Center, 
USA 

The manuscript presents a critical review of many of 
the existing motif finding tools and pipelines that are de- 
signed for the CHIP-seq data analysis. Many applications 
of NGS technologies including CHIP-seq applications 
has been steadily growing over past 5-8 years with ever 
growing amount of data generated across the globe. 
With the costs of next generation sequencing falling fast 



toward a $1000 mark per genome, many of the sequen- 
cing applications are becoming more accessible for re- 
searchers. There is a clearly identifiable need for effective, 
scalable and reproducible computational tools allowing 
for fast and cost effective processing and analysis of this 
vast amount of raw sequencing data. 

The authors provided a detailed computational review 
and comparison of 9 published software packages for 
CHIP-seq data processing and analysis. 

The results of their comparative analysis are clearly 
presented and provide, perhaps for the first time a sur- 
vey of capabilities, advantages and limitations of the 
most current tools. One of the noticeable results of their 
study is that there is a dramatic difference in the output 
of these tools even thought the same data sets were ana- 
lysed. An overlap reported in the paper is ranging from 
0 to 100% that is a clear indication of a problem with 
the existing computational algorithms. The authors have 
proposed a computational criteria for selection of the 
best tool based on the largest number of binding motifs 
found for any particular data set. However, in reviewer 
opinion, it is clear that the results obtained with any par- 
ticular tool might have high level of false positive results 
and purely computational approaches do not provide a 
clear path to avoid high level of false positive results. 
The biological validation might offer one of the solutions 
for this predicament however if the number of predicted 
binding sites is high the experimental validation might 
not be feasible. 

Overall, this paper presents a timely and useful survey 
of CHIP-seq computational pipelines and while it might 
be of most interest for a relatively narrow community of 
bioinformaticians involved with NGS-seq data analysis, 
it is also could serve as a guide for the growing number 
of bio-medical researchers involved in translational and 
clinical applications of NGS technologies. 

Author's response: 

We found that different Web tools reported different 
number of motifs even for the same dataset. Motifs re- 
ported by different Web tools that implement different al- 
gorithms are more reliable and we suggested the users to 
use multiple Web tools that implement different algo- 
rithms for obtaining overlapping motifs. The biological 
validation is one of the best ways to validate motifs but 
it may be impractical for high volume of motifs. We hope 
the number of overlapping motifs from multiple Web 
tools is feasible for this validation. 

Quality of written English: Acceptable. 

Reviewer's report 3 

Dr. Shy am Prabhakar (nominated by Prof. Limsoon Wong) 
The manuscript "A survey of motif finding Web tools 
for detecting binding site motifs in ChlP-Seq data" by 
Tran and Huang reviews nine web tools for motif 
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discovery. The authors describe the features of the tools 
and apply them to five mouse ChlP-seq datasets. They 
then quantify overlaps between the resulting motif lists. 
Finally, they suggest that multiple tools be applied to 
any individual data set, since each method has its own 
pluses and minuses. 

Since there are many online motif discovery tools, it is 
certainly useful to have guidance on which tool one 
should use on any particular ChlP-seq dataset. The tool 
that's best for histone ChlP-seq may not be the same as 
the one that's best for TF ChlP-seq. Some tools may 
work well only when the dataset is relatively "clean," and 
others may work under almost all conditions. Some may 
require tightly- defined binding regions, whereas others 
may tolerate broader regions extending beyond a thou- 
sand basepairs. Unfortunately, these issues are not ad- 
dressed in any way. In fact, the manuscript provides no 
guidance at all on the quality of the predictions made by 
the various tools. At the end, one is still left wondering 
which tool(s) one should use. The only concrete recom- 
mendation is that it is better to use multiple tools, but 
in bioinformatics this is a platitude. 

Author's response: 

We have presented the detailed features of each Web 
tool in the manuscript. For example, MEME suggests re- 
moving duplicate and low information sequences in the 
input dataset. MEME does not detect motifs for co- 
factors. However, other Web tools such as CisFinder, 
DREME, and PScanChIP are capable for detecting bind- 
ing motifs for cofactor TFs. We have provided as much 
details as possible for the input dataset's properties that 
each Web tool can accept. For instance, PScanChIP only 
processeslOO-150 bp around the center of the summit of 
the peak, MEME can take < 1000 bp for sequence's 
length, and CisFinder can accept the sequence's length < 
SO Mb produced by the peak caller. We hope these prop- 
erties assist the users for deciding which tool is capable 
for processing short or broader regions. We have also pro- 
vided the details of the output that each Web tool can 
provide, for instance, the size of the motif (short or lon- 
ger) that each Web tool can detect and return. 

All Web tools allow verifying discovered motifs with the 
reference. We have validated the discovered motifs re- 
ported by each tool in our similarity comparisons and 
found most of them exist in the reference databases (See 
our response to your suggestions section). However, some 
tools reported more motifs than others for the same data- 
set. Thus, we compared these tools for the motifs they re- 
ported on the same dataset. The comparison's details, 
results, and discussions on the results reported by each 
Web tool are presented in the manuscript. Based on the 
comparison results we think it cannot be recommended 
which Web tool should be used for a particular ChlP-Seq 
dataset because there is no certainty to say precisely 



which tool is best for a particular ChlP-Seq dataset. 
Thus, we can only suggest the users to use multiple Web 
tools that implement different algorithms because the 
users can see exactly what they can get from each Web 
tool for their dataset and take appropriate action. This is 
also the approach that the pipeline motif detection tools 
implement which we discussed in the manuscript. We 
also suggested the users to obtain the overlapping motifs, 
which are more reliable because they are reported by dif- 
ferent tools that implement different algorithms. 

The section that lists the general features of each of 
the nine web tools seems too long. In the format sent to 
reviewers, it covers 12-13 pages. The features men- 
tioned in this section are often not particularly note- 
worthy ("CisFinder 's output can be in HTML and text 
format"), and also frequently redundant because they are 
listed again in Table 2. This section could be shortened 
considerably. 

Author's response: 

We have provided as much details as possible to the 
readers so they can see the detailed features that each 
Web tool can provide. Table 2 contains a short summary 
of the features for each Web tool. We think this table can 
be used for a quick lookup. 

The section on peak-calling tools is not well moti- 
vated. In general, everything upstream has an influence 
on motif discovery: the peak-caller, the binding land- 
scape of the TF, the quality of the antibody used, noise 
in the ChlP-seq data set, read length, the read mapping 
protocol, the thoroughness with which repeats are 
masked, and so on. The list could be quite long, and it is 
not clear that peak calling is the most important factor, 
now that peak callers have become reasonably robust. I 
would not be surprised if the use of an inappropriate 
peak caller or an inappropriate pvalue threshold resulted 
in failure to discover relevant motifs. However, as far as 
I can tell, this manuscript does not provided any data or 
cite any papers that quantify the effect of peak calling on 
motif discovery. 

Author's response: 

We mentioned some of the influence factors pointed 
out above in the revised manuscript and we focused only 
on the closest influence factor, which is the peak calling 
tool. We only presented the general idea that the result of 
peak calling tool used for finding motifs has an influence 
on the result of motif finding and suggested the users to 
consider software tools that are able to assist them for 
optimizing the peak calling results relevant to their data's 
property. The suggested tools are discussed in the 
manuscript. 

The five ChlP-seq data sets used to evaluate the motif 
discovery tools are problematic - it is not obvious in 
most cases what the correct motifs are (only one of the 
five is a DNA-binding TF). One could guess that the 
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motifs of liver-specific TFs should be enriched in, say, 
H3K27ac peaks, but no attempt is made to check if this 
is the case, or to evaluate the algorithms in this way. 
Author's response: 

As presented above, each Web tool allows verifying 
found motifs with one or more reference databases such 
as TRANSFAC or JASPAR using P-value or E-value 
threshold. We also validated the discovered motifs re- 
ported by each tool in our similarity comparisons and 
found most of them exist in the reference databases (See 
our response to your suggestions section). We rely on the 
correctness of TOMTOM and other methods that each 
Web tool used for verifying the motifs. 

The Results section relies mainly on Table 15. This 
matrix-like table lists the proportion of motifs discov- 
ered by tool X that are also discovered by tool Y. The 
data in the table can be used to cluster the algorithms 
into groups by similarity. However, the similarity rela- 
tionships could in many cases have been predicted in ad- 
vance, because many of the web tools employ the same 
algorithms (MEME, Weeder) at the back end. Due to 
this sharing of back-end algorithms, multiple web tools 
could potentially identify the same incorrect motif 

Author's response: 

We have revised our suggestion for the users to use mul- 
tiple Web tools that implement different algorithms because 
different Web tools, which implement different algorithms 
at the backend, report the same motifs for the same dataset 
are more reliable than a single Web tool. Although some 
Web tools implemented the same algorithms at the back- 
end, they do not always report the same motifs. For ex- 
ample, MEME-ChIP integrated MEME and DREME into a 
pipeline. However, MEME-ChIP did not report any motif 
for the dataset DM230 although MEME reported 20 motifs 
and DREME reported one motif for the same dataset. 

Suggestions: 

It would have been more useful to start with ChlP-seq 
datasets for TFs that have known motifs (derived from 
protein-binding microarray data, for example), and then 
evaluate the web tools on their ability to recover the 
known motifs. Another possibility would be to evaluate 
the tools on the number of motifs they discover that 
match motifs contained in TRANSFAC or JASPAR. This 
latter approach is suitable if one is testing for co-motifs 
(motifs bound by TFs that co-bind with the ChlP-ed 
TF). However, it is vulnerable to artifacts - false GC- 
rich or AT-rich motifs frequently match TRANSFAC en- 
tries with the same nucleotide composition. Yet another 
suggestion would be to use cross-validation as a measure 
of motif quality/accuracy. 

Author's response: 

We validated all motifs used for our similarity com- 
parisons with the references databases JASPAR and 
UniProbe for mouse species using TOMTOM program. 



Most of these motifs were found in either JASPAR or 
UniProbe with P-value < 0.01. We rely on the correctness 
of TOMTOM and other methods that each Web tool uses 
for verifying found motifs with the reference databases. 
For example, MEME, GLAM2, DREME, and MEME- 
ChIP use TOMTOM. W-ChlPMotifs uses STAMP and so 
on. We have added a validation paragraph to the subsec- 
tion Discussion in the manuscript. Below is the additional 
paragraph (paragraph 2 of the subsection Discussion). 

"We validated all motifs used for similarity compari- 
sons with two reference databases: JASPAR [30] and 
UniProbe [33] for mouse species using TOMTOM [36] 
program with P-value cutoff < 0.01. All discovered motifs 
in each dataset by MEME, GLAM2, W-ChlPMotifs, 
MEME-ChIP, and PScanChIP were found in either JAS- 
PAR or UniProbe. All discovered motifs by CisFinder for 
four datasets DM230, DM05, DM254, and DM721were 
found in either JASPAR or UniProbe except for one motif 
in the dataset DM01 was not found both databases. In 
addition, all discovered motifs by DREME for three data- 
sets DM230, DM254, and DM721 were found in either 
JASPAR or UniProbe except for 2 motifs in the dataset 
DM01 were not found in both databases. Besides, RSAT 
peak-motifs showed two motifs that were not found in 
both references with one from the dataset DM254 and 
the other from the dataset DM01. All other discovered 
motifs by RSAT peak-motifs in other datasets were found 
in either JASPAR or UniProbe. In general, most of discov- 
ered motifs reported by each tool in each dataset used for 
similarity comparisons were found in the references for 
mouse species. All validation results can be found in col- 
umn 4 of the Supplementary Table 11." 

Minor issues not for publication: 

1) Introduction: "Assume that a motif appears in each se- 
quence, we have (n -1 + l)'^t possible candidates for motifs." 
To be more precise, perhaps this should be written as, "As- 
suming that exactly one motif appears in each sequence, ..." 
Also, the authors should clarify that this statement applies 
only to the algorithms tested in this survey. It does not apply 
to thermodynamically based algorithms such as QPMEME, 
MatrixREDUCE and TherMoS, which use nonlinear 
optimization on a continuous space of affinity models. 

Author's response: 

We have revised the definition of motif finding problem 
with more details in the manuscript. This is a simple def- 
inition, which may not apply to every algorithm dis- 
cussed in this manuscript. The change made in the first 
paragraph of the section General approaches for motif 
finding in the manuscript is below. 

"...Motifs are short sequences of a similar pattern found 
in sequences of DNA or protein. Consider t input nucleo- 
tide sequences of length n and an array s (si, S2, S3, sj) of 
starting positions with each position comes from each se- 
quence. An alignment matrix is a matrix of tx I, which 
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contains t sequences of starting positions from each se- 
quence with length I where I is the size of an l-mer. A pro- 
file matrix is a matrix of 4x1 containing 4 rows for four 
nucleotides (A, C, G, T) and I columns. Each entry in the 
profile matrix is the frequency of each nucleotide in the 
alignment matrix. The consensus score is the sum of high- 
est frequencies from each column in the profde matrix. The 
motif finding problem can he stated simply as follows. 
Given t input nucleotide sequences of length n, we want to 
find a set ofl-mers with one from each sequence such that 
they maximize the consensus score. Thus, we need to con- 
sider all possible starting positions or candidates 
for motifs. That is the number of candidates for motifs is 
exponential of the number of input sequences... " 

2) Motif finding Web tools section: "The EM algo- 
rithm has the following drawbacks ... It assumes there is 
exactly one appearance of the shared motif appearing in 
each sequence of the dataset but this is not always the 
case." The wording is a bit confusing here, because this 
is not really a drawback of EM per se. Rather, it is a 
drawback of one specific application of EM. 

Author's response: 

We have clarified it in paragraph 3 of the subsection 
MEME in the manuscript as follows. 

"... The EM algorithm for motif finding presented by 
Lawrence et al. has the following drawbacks. It is not 
clear how to select the starting point and when to stop 
trying different starting points. It assumes there is exactly 
one appearance of the shared motif appearing in each se- 
quence of the dataset but this is not always the case..." 

3) Same section: "MEME can only model a single 
motif at a time and it is unable to find alternative bind- 
ing motifs or motifs for co-factors." MEME should be 
able to find motifs for co-factors, because it masks previ- 
ously discovered motifs when it looks for new motifs. 

Author's response: 

We have revised it in the last paragraph of the subsec- 
tion MEME in the manuscript as follows. 

"Because MEME erases previous discovered motifs 
when it searches for new motifs, MEME can only model 
a single motif at a time and it does not detect alternative 
binding motifs, which are motifs for co-factors." 

4) Page 9: the acronym PSPM should be defined. More 
generally, many terms are used to describe binding affinity 
models (PSPM, PSSM, PWM, letter-probability matrix, 
Transfac matrix). As far as I could tell, some of these terms 
mean the same thing, at least as used in this manuscript. 

Author's response: 

We have explained the acronym PSPM in paragraphs 3 
and 4 of the subsection GLAM2 in the manuscript as follows. 

"...Other options in the HTML output include viewing 
alignment, viewing Position Specific Probability Matrix 
(PSPM), and finding replications that are similar to the 
best motif found [5]. 



The PSPM is a 4x I matrix containing 4 rows for four 
nucleotides (A, C, G, T) and I columns where I is the size 
of the motif. Each entry in the matrix is the frequency of 
a nucleotide in the multiple alignments of the sequences. 
This frequency is represented by a probability value. " 

5) Results and Discussion, first sentence: "We used five 
datasets from ChlP-Seq experiments in Shen et al. [82] 
in Table 3 for our motif search." "Motif search" should 
be replaced with "motif discovery." 

Author's response: 

We have revised this sentence as suggested. Below is the re- 
vised sentence in the subsection Datasets in the manuscript 

"We used five datasets from ChlP-Seq experiments in 
Shen et al. [82] in Table 3 for our motif discovery." 

6) Table 14: It's not clear what is meant by "Raw 
PSSM." The second column (matrix type) contains many 
different entries. How can the matrices be compared 
when the matrix type used for comparison is not the 
same? On the other hand, if the matrix type really is the 
same, could this column be left out? 

Author's response: 

We have added a definition for raw PSSM in para- 
graph 5 of the subsection Discussion in the manuscript. 
We also directed the readers to a reference, which con- 
tains an URL of the site explaining this format. The 
change made in this paragraph is below. 

"The output of W-ChlPMotifs includes the frequencies 
of nucleotides but they are not in the form of matrices. 
Thus, we converted these frequencies into raw PSSMs 
[84], which were used to compare with the motif results 
from other Web tools. Raw PSSM is defined in [84] as 
follows. It is an 1x4 matrix containing 4 columns for 
four nucleotides (A, C, G, T) and I rows for the size of the 
motif. Each entry in the matrix is the frequency value of 
a nucleotide in the multiple sequence alignments. The 
matrix is leaded by a character ">" followed by some 
characters, which can be the name of the matrix..." 

Different matrix types can be compared with each other by 
STAMP tool as this tool accepts a wide variety of matrix for- 
mats. This flexibility allowed us to perform the comparisons 
presented in the manuscript We included the matrix type col- 
umn in this table for providing details of the comparisons to 
the readers. This table has been moved to the Supplementary 
Tables file and it was renamed to Supplementary Table 10. 

7) Table 15: If I'm not mistaken, "N/A" should be re- 
placed by "0" in Row 2, Column 8, which shows the 
MEME-DREME comparison. 

Author's response: 

It was an error in the table. We have fixed this error 
by replacing "N/A" with "0 (0%)" in Row 2, Column 8. 
This error has been corrected for all rows and columns 
where the comparison between a dataset that has zero 
motif with another dataset that has zero or more 
motifs. 
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Quality of written English: Needs some language cor- 
rections before being published. 

Second round 
Reviewer's report 1 

Prof. Sandor Pongor, International Centre for Genetic En- 
gineering and biotechnology (ICGEB), Italy 
Accepted. 

Quality of written English: Acceptable. 
Reviewer's report 2 

Dr. Yuriy Gusev, Georgetown University Medical Center, USA 
I am satisfied with the authors response to my com- 
ments and recommend to accept the manuscript for 
publication. 
Quality of written English: Acceptable. 

Reviewer's report 3 

Dr. Shyam Prabhakar (nominated by Prof. Limsoon Wong). 

The revised version fixes some of the issues raised in 
the first round of review. However, my main concern re- 
mains that the study provides no guidance on the quality 
of the predictions made by the various tools. 

In the revised version, the authors have attempted to 
address this point by comparing the de novo predicted 
motifs against databases of known motifs: JASPAR and 
UniProbe. It is claimed that most of the predicted motifs 
exist in the reference databases, and therefore the pre- 
dictions are valid. However, it is not clear if the database 
matching was done correctly. As far as I can tell, motifs 
were considered to have a database match if TOMTOM 
found a hit with raw P-value < 0.01. Because of the mul- 
tiple testing problem (there are hundreds of motifs in 
the JASPAR + UniProbe database), this is actually a very 
loose P-value threshold. It corresponds to a false-discovery 
rate not far fi-om 100%. In other words, even random, non- 
sense motifs would match the database at this P-value 
threshold. I submitted three random motifs as sample quer- 
ies to TOMTOM: GSTWGR, AGACG and CMAWGT. 
These motifs were plucked out of thin air - as far as I 
know, they do not correspond to any real transcription fac- 
tors. All three returned database matches with P < 0.01. 

If the authors applied a false-discovery rate cutoff 
(TOMTOM q-value < 0.01, for example), it's likely that 
only a fraction of predicted motifs would have database 
matches. This is because the number of motif predic- 
tions is too large - on average each tool predicted 34 
motifs in one ChlP-seq dataset (average of Column 3 in 
Supplementary Table 11). Only the top few motifs in 
these lists are likely to be genuine. 

Author's response: 

We have validated a few motifs using q-value cutoff < 
0.01 on TOMTOM. We found this q-value cutoff resulted 
in losing motifs that we think they are significant because 



these motifs were reported by multiple tools. Below are 
some examples. 
Example 1: 

This motif below was found by MEME in the dataset 
DM05. It is motif number 17 in the list of total 46 motifs 
reported by MEME. 



Motif 1 7 position-specific probability matrix 

letter-probability matrix: alength = 4 w= 11 nsites= 2 E= I.le+009 

0.000000 1.000000 0.000000 0.000000 

0.000000 0.000000 0.000000 1.000000 

0.000000 0.000000 0.000000 1.000000 

0.000000 1.000000 0.000000 0.000000 

0.000000 0.000000 0.000000 1.000000 

0.000000 0.000000 1.000000 0.000000 

1.000000 0.000000 0.000000 0.000000 

0.000000 1.000000 0.000000 0.000000 

0.000000 1.000000 0.000000 0.000000 

0.000000 0.000000 0.000000 1.000000 

0.000000 1.000000 0.000000 0.000000 

This motif was also found by GLAM2, CisFinder, W- 
ChlPMotifs, MEME-ChIP, peak-motifs, and PScanChlP. 
TOMTOM reported several matches for this motif with 
p-values < 0.01 but q-values are much larger than 0.01 
(at least 0.188899 or greater at the time of this valid- 
ation) for mouse species in JASPAR or UniProbe data- 
base. One example of these matches is Hoxc9_2367.2 
(homeo box C9) for this motif in the UniProbe database 
for mouse with p-value = 0.00327818 and q-value = 
0.6349. We also validated this motif again with STAMP 
and found that STAMP also reported the same match, 
which is Hoxc9_2367.2 (homeo box C9) for this motif in 
the UniProbe database with E-value = 3.5754e-02. 

Example 2: 

This motif below was found by MEME-ChIP in the 
dataset DM01. It is motif number 9 in the list of total 9 
motifs reported by MEME-ChIP. 



Motif 9 position-specific probability matrix 
letter-probability matrix: alength= 4 w= 14 nsites= 40 E= 7.2e-014 



0.7 


0.05 


0 


0.25 


1 


0 


0 


0 


0.5 


0.125 


0.2 


0.175 


0.775 


0.225 


0 


0 


0.95 


0.05 


0 


0 


1 


0 


0 


0 


0.875 


0 


0.075 


0.05 


0.375 


0.575 


0.05 


0 


1 


0 


0 


0 


0.9 


0.025 


0.075 


0 


0.9 


0.075 


0.025 


0 


0.7 


0.175 


0 


0.125 


0.775 


0 


0.15 


0.075 


0.525 


0.3 


0.175 


0 
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This motif was also found by CisFinder, peak-motifs, 
DREME, and PScanChlP. TOMTOM also reported 
several matches for this motif with p-values < 0.01 but 
q-values are much larger than 0.01 (at least 0.12092 
or greater at the time of this validation) for mouse 
species in JASPAR or UniProbe database. 

Example 3: 

The motif below was found by peak-motifs in the data- 
set DM254. It is motif number 37 in the list of total 39 
motifs reported by peak-motifs. 



DE yvTGCyGCCmCCwGgtG 


PO 


A C G T 


1 


62 94 67 109 


2 


84 105 83 60 


3 


18 13 19 282 


4 


16 13 294 9 


5 


5 294 23 10 


6 


12 89 5 226 


7 


11 6 309 6 


8 


5 320 4 3 


9 


2 327 2 1 


10 


217 104 3 8 


11 


4 262 56 10 


12 


5 243 3 81 


13 


149 3 1 179 


14 


17 4 305 6 


15 


79 21 223 9 


16 


55 54 76 147 


17 


24 24 264 20 


XX 





This motif was also reported by CisFinder, MEME- 
ChlP, DREME, and PScanChlP. Same as above, TOMTOM 
reported several matches for this motif with p-values < 0.01 
but q-values > 0.01 (at least 0.354709 or greater at the time 
of this validation) for mouse species in JASPAR or UniProbe 
database. 

The motifs in the examples above were found by multiple 
tools. TOMTOM found matches for these motifs in either 
JASPAR or UniProbe database for mouse with p-values < 
0.01. We think these motifs are significant and should not 
be eliminated. However, the q-values reported by TOM- 
TOM for these motifs exceed the stringent cutoff 0.01. 
Thus, if we applied this stringent cutoff q-value < 0.01 for 
the motifs in our similarity comparisons it would result in 
losing significant motifs. 

I would suggest that more stringent cutoffs be applied 
at all stages of the analysis. It would probably help quite 
a bit to consider only the top motif predictions, and also 
to run TOMTOM with a q-value threshold rather than 
a P-value threshold. I am not completely clear on how 
STAMPY was applied in this study, but it would be im- 
portant to apply a q-value cutoff there as well. 

Author's response: 

Using more stringent cutoffs can eliminate false posi- 
tives. However, these stringent cutoffs can also eliminate 



significant motifs. We do not have suggestion for the 
exact stringent cutoff that would balance both cases. 
Thus, we leave this value for the users to decide appro- 
priately for their research. 

We used STAMP for finding similar motifs that were 
reported by multiple tools using E-value cutoff < 0.05 in 
our study. The results from STAMP provide similar mo- 
tifs using E-value only. Therefore, we can only use E- 
value for the cutoff. 

Additional file 



Additional file 1: Table SI. Parameters selected for running MEME 
motif finding Web tool. Table S2. Parameters used for running GLAIV12 
motif finding Web tool. Table S3. Parameters used for running 
CompletelVlOTIEs motif finding Web tool. Table S4. Parameters used for 
running CisFinder motif finding Web tool. Table S5. Parameters used 
for running DREIVIE motif finding Web tool. Table S5. Parameters used 
for running IVlEIVlE-ChIP motif finding Web tool. Table S7. Parameters 
used for running R5AT peak-motifs motif finding Web tool. Table S8. 
Parameters used for running PScanChlP motif finding Web tool. Table S9. A 
summary of the motif results for each dataset and Web tool. Table S10. 
Matrix types for motifs comparisons. Table Sll . Comparing motif results 
between each motif finding Web tool with other motif finding Web tools 
for the number of best matched motifs using E-value < 0.05 for each dataset 
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